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Abstract 


Producing specifications by dynamic (runtime) analysis of 
program executions is potentially unsound, because the ana- 
lyzed executions may not fully characterize all possible ex- 
ecutions of the program. In practice, how accurate are the 
results of a dynamic analysis? This paper describes the re- 
sults of an investigation into this question, comparing spec- 
ifications generalized from program runs with specifications 
verified by a static checker. The surprising result is that for a 
collection of modest programs, small test suites captured all 
or nearly all program behavior necessary for a specific type 
of static checking, permitting the inference and verification 
of useful specifications. For ten programs of 100-800 lines, 
the average precision, a measure of correctness, was .95 and 
the average recall, a measure of completeness, was .94. 

This is a positive result for testing, because it suggests 
that dynamic analyses can capture all semantic information 
of interest for certain applications. The experimental results 
demonstrate that a specific technique, dynamic invariant de- 
tection, is effective at generating consistent, sufficient spec- 
ifications. Finally, the research shows that combining static 
and dynamic analyses over program specifications has ben- 
efits for users of each technique. 


1 Introduction 


Dynamic (runtime) analysis obtains information from pro- 
gram executions; examples include profiling and testing. 
Rather than modeling the state of the program, dynamic 
analysis uses actual values computed during program exe- 
cutions. Dynamic analysis can be efficient and precise, but 
the results may not generalize to future program executions. 
This unsoundness makes dynamic analysis inappropriate for 
certain uses, and it may make users reluctant to depend on 
the results even in other contexts because of uncertainty as 
to their reliability. 

By contrast, static analysis operates by examining pro- 
gram source code and reasoning about possible executions. 
It builds a model of the state of the program, such as val- 
ues for variables. Static analysis can be conservative and 


sound, and it is theoretically complete [CC77]. However, it 
can be inefficient, can produce weak results, and (as in the 
case of theorem-proving or program verification) can require 
explicit goals or annotations. 

We have integrated and compared static and dynamic 
analyses over program specifications in order to understand 
the relationships between them. In particular, our investi- 
gation provides preliminary answers to the following ques- 
tions. 


How accurate is dynamic analysis? We do not have a 
theoretical answer to this question, nor can we predict how 
useful analysis results will be. (In any event, the answer 
depends on the particular use and user.) However, our ex- 
periments provide an interesting datapoint for the specific 
example of program specifications. Specifications form a 
particularly rich domain that captures a great deal of what is 
interesting about a program’s semantics, and we show that a 
dynamic analysis can recover them accurately. 


How can dynamic analysis be improved? The accuracy 
of a dynamic analysis can be improved in at least three ways. 
First, the dynamic analysis itself can be made more discrimi- 
nating; we show that our specification inference analysis and 
its implementation are effective. Second, the dynamic analy- 
sis can be integrated with other analyses. For instance, pass- 
ing potentially unsound output through a checker to remove 
unverifiable properties improves soundness while possibly 
reducing completeness. We have implemented and evaluated 
such a system. (The checker used by our implementation is 
unsound, but it nevertheless is of substantial benefit; its se- 
lection was an engineering tradeoff.) Third, feedback from 
the dynamic analysis can indicate how to improve test suites. 
Feedback about properties (not) satisfied may be at least as 
effective as code coverage feedback about lines (not) exe- 
cuted. This paper does not directly address such feedback, 
however. 


How can dynamic analysis be used despite unsoundness? 
A dynamic analysis might produce results that are correct 
over all possible executions. If the results can be verified, 


then they can be used as if they resulted from a sound analy- 
sis. Our techniques produce fully verifiable results in many 
circumstances, but even less than perfect results can be of 
use. For instance, selecting and expressing goals for static 
verification can be difficult and tedious, and current sys- 
tems have trouble postulating them. Starting from partial or 
nearly-true specifications could be easier for various tasks, 
including program verification, than starting from no speci- 
fications at all. Tool support for generating specifications has 
the potential to ease use of formal methods, enabling them 
to become more practical and more widely used. We provide 
preliminary evidence to support this claim. 


Our results demonstrate that much of program semantics 
are present in test executions, as measured against verifiabil- 
ity of generated specifications. They also demonstrate that 
the technique of dynamic invariant detection is effective in 
capturing this information, and that the results are effective 
for the task of verifying absence of runtime errors. Finally, 
they show that static and dynamic analyses can be integrated 
to overcome the shortcomings of each: unsoundness for the 
dynamic analysis and lack of goals or tedious annotation for 
the static analysis. 


1.1 Approach 


We used program specifications to investigate the rela- 
tionship between dynamically and statically available in- 
formation about a program, and the accuracy of the for- 
mer. Our approach is to extract specifications from program 
runs [Ern00, ECGNO1] and determine whether they are cor- 
rect and sufficient. For the purposes of this paper, our suffi- 
ciency measure is machine verifiability of the specifications. 
Correct specifications may be insufficient if limitations of 
the verifier prevent them from being proven. 

The generated specifications are program invariants. 
These specifications are partial: they describe and constrain 
behavior but do not provide a full input-output mapping. 
The specifications are also unsound: as described later, the 
properties are likely, but not guaranteed, to hold. 

A program invariant is a property that is true (or puta- 
tively true) at a particular program point or points, such as 
might appear in an assert statement or a formal specifi- 
cation. Invariants include procedure preconditions and post- 
conditions, loop invariants, and object (representation) in- 
variants. Examples include y = 4 * x + 3; x > abs(y); ar- 
ray acontains no duplicates; n = n.child.parent (for all nodes 
n); size(keys) = size(contents); and graph g is acyclic. In- 
variants explicate data structures and algorithms and are 
helpful for programming tasks from design to mainte- 
nance. Invariants assist in creation of better programs 
[Gri81, LG86, HHJ*87b, HHJ*87a], document program 
operation [KL86, LCKS90], assist testing and enable cor- 
rect modification [OC89, GKMSO0], assist in test-case gen- 
eration [TCMM98] and validation [CR99], form a program 
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Figure 1: Generation and checking of program specifications re- 
sults in a specification together with a proof of its correctness. Our 
generator is the Daikon invariant detector, and our checker is the 
ESC/Java static checker. 


spectrum [AFMS96, RBDL97, HRWY98], and can enable 
optimizations [CFE99], among other uses. Despite their ad- 
vantages, invariants are usually not stated explicitly in pro- 
grams. 

Dynamic invariant detection is a technique for postulat- 
ing likely invariants from program runs: a dynamic invariant 
detector runs the target program, examines the values that it 
computes, and looks for patterns and relationships over those 
values, reporting the ones that are always true over an entire 
test suite and that satisfy certain other conditions (see Sec- 
tion 2.1). The outputs are likely invariants: they are not guar- 
anteed to be universally true, because the test suite might not 
characterize all possible executions of the program. 

To explore the issues listed above, we have integrated a 
dynamic invariant detector, Daikon [Ern00, ECGNO1], with 
a static verifier, ESC/Java [DLNS98, LNSOO]. Our system 
operates in three steps (see Figure 1) [NEO1]. First, it runs 
Daikon, which outputs a list of likely invariants obtained 
from running the target program over a test suite. (We use 
the term “test suite” for any inputs over which executions are 
analyzed; those inputs need not satisfy any particular prop- 
erties regarding code coverage or bug detection.) Second, 
it inserts those likely invariants into the target program as 
annotations. Third, it runs ESC/Java on the annotated tar- 
get program to report which of the likely invariants can be 
statically verified and which cannot. All three steps are com- 
pletely automatic, though users may provide guidance in or- 
der to obtain better results if desired. Users may edit and 
re-run test suites, or may add or remove specific program 
annotations by hand. 

The remainder of this paper is organized as follows. Sec- 
tion 2 provides background on the dynamic invariant detec- 
tor and static verifier used by our system. Section 3 presents 
results from several experiments. Section 4 notes challenges 
that arose while building and running our system. Section 5 
discusses lessons learned from the experiments. Finally, 
Section 6 relates our results to other research, Section 7 pro- 
poses follow-on research, and Section 8 concludes. 


2 Background 


This section describes dynamic detection of program invari- 
ants, as performed by the Daikon tool, and static checking 
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Figure 2: An overview of dynamic detection of invariants as im- 
plemented by Daikon. 
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of program annotations, as performed by the ESC/Java tool. 
Full details about the techniques and tools appear elsewhere. 


2.1 Daikon: Invariant discovery 


Dynamic invariant detection [Ern00, ECGNO1] discovers 
likely invariants from program executions by instrumenting 
the target program to trace the variables of interest, running 
the instrumented program over a test suite, and inferring in- 
variants over the instrumented values (Figure 2). The infer- 
ence step tests a set of possible invariants against the val- 
ues captured from the instrumented variables; those invari- 
ants that are tested to a sufficient degree without falsification 
are reported to the programmer. As with other dynamic ap- 
proaches such as testing and profiling, the accuracy of the 
inferred invariants depends in part on the quality and com- 
pleteness of the test cases. The Daikon invariant detector is 
language independent, and currently includes instrumenters 
for C++ and Java. 

Daikon detects invariants at specific program points such 
as procedure entries and exits; each program point is treated 
independently. The invariant detector is provided with a 
variable trace that contains, for each execution of a program 
point, the values of all variables in scope at that point. Each 
of a set of possible invariants is tested against various com- 
binations of one, two, or three traced variables. 

For scalar variables x, y, and z, and computed con- 
stants a, b, and c, some examples of checked in- 
variants are: equality with a constant (x =a) or a 
small set of constants (x € {a,b,c}), lying in a range 
(a <x <b), non-zero, modulus (x = a (mod b)), linear re- 
lationships (z = ax + by +c), ordering (x < y), and func- 
tions (y = fn(x)). Invariants involving a sequence variable 
(such as an array or linked list) include minimum and maxi- 
mum sequence values, lexicographical ordering, element or- 
dering, invariants holding for all elements in the sequence, 
or membership (x € y). Given two sequences, some exam- 
ple checked invariants are elementwise linear relationship, 
lexicographic comparison, and subsequence relationship. 

In addition to locally-checkable invariants such as node 
= node.child.parent (for all nodes), Daikon detects global 
invariants over pointer-directed data structures, such as 
mytree is sorted by <, by linearizing graph-like data struc- 
tures. Finally, Daikon can detect conditional invariants 
such as “if p # null then p.value > x” and “p.value > limit or 
p.left € mytree”. Conditional invariants result from splitting 


data into parts based on the condition and comparing the re- 
sulting invariants; if the invariants in the two halves differ, 
they are composed into a conditional invariant [EGKN99]. 

For each variable or tuple of variables in scope at a given 
program point, each potential invariant is tested. Each po- 
tential unary invariant is checked for all variables, each po- 
tential binary invariant is checked over all pairs of variables, 
and so forth. A potential invariant is checked by examin- 
ing each sample (i.e., tuple of values for the variables being 
tested) in turn. As soon as a sample not satisfying the invari- 
ant is encountered, that invariant is known not to hold and is 
not checked for any subsequent samples. Daikon maintains 
acceptable performance as program size increases because 
false invariants tend to be falsified quickly, so the cost of de- 
tecting invariants tends to be proportional to the number of 
invariants discovered. All the invariants are inexpensive to 
test and do not require full-fledged theorem-proving. 

An invariant is reported only if there is adequate statistical 
evidence for it. In particular, if there are an inadequate num- 
ber of observations, observed patterns may be mere coin- 
cidence. Consequently, for each detected invariant, Daikon 
computes the probability that such a property would appear 
by chance in a random set of samples. The property is re- 
ported only if its probability is smaller than a user-defined 
confidence parameter [ECGNOO]. 

The Daikon invariant detector is available from http: // 
sdg.lcs.mit.edu/daikon/. 


2.2 ESC: Static checking 


ESC [Det96, DLNS98, LN98] is an Extended Static Checker 
that has been implemented for Modula-3 and Java. It stat- 
ically detects common errors that are usually not detected 
until run time, such as null dereference errors, array bounds 
errors, and type cast errors. 

ESC is intermediate in both power and ease of use be- 
tween typecheckers and theorem-provers, but it aims to be 
more like the former and is lightweight by comparison with 
the latter. Rather than proving complete program correct- 
ness, ESC detects only certain types of errors. Programmers 
must write program annotations, many of which are similar 
in flavor to assert statements, but they need not interact 
with the checker as it processes the annotated program. ESC 
issues warnings about annotations that cannot be verified and 
about potential run-time errors. 

ESC performs modular checking: it checks different parts 
of a program independently and can check partial programs 
or modules. It assumes that specifications for missing or 
unchecked components are correct. ESC’s implementation 
uses a theorem-prover internally. We will not discuss ESC’s 
checking strategy in more detail because this research treats 
ESC as a black box. (It is distributed in binary form.) 

ESC/Java is a successor to ESC/Modula-3. ESC/Java’s 
annotation language (see Section 4.1) is simpler, because it 
is slightly weaker. This is in keeping with the philosophy of 
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stack represented by an array 

queue represented by an array 

disjoint sets supporting union, find 
java.util.Vector growable array 


StreetNumberSet collection of numeric ranges 


GeoSegment pair of points on the earth 

Graph generic graph data structure 
RatNum rational number 

RatPoly polynomial over rational numbers 
FixedSizeSet set represented by a bitvector 


Figure 3: Description of the analyzed programs. These programs 
are available from the authors. 


a tool that is easy to use and useful to programmers rather 
than one that is extraordinarily powerful but so difficult to 
use that programmers shy away from it. 

ESC is not sound; for instance, it does not model arith- 
metic overflow, and permits the user to supply (unverified) 
assumptions. However, ESC provides a good approximation 
to soundness. 

This paper uses ESC/Java not only as a lightweight tech- 
nology for detecting a restricted class of runtime errors, but 
also as a tool for verifying representation invariants and 
method specifications. We chose to use ESC/Java because 
we are not aware of other equally capable technology for 
statically checking properties of runnable code. Whereas 
many other verifiers operate over non-executable specifica- 
tions or models, our research aims to compare and combine 
dynamic and static techniques over the same code artifact. 

Both versions of ESC are publicly available from http: 
//cesearch.compaq.com/SRC/esc/. 


3 Experiments 


This section gives quantitative and qualitative results from a 
number of experiments. Results demonstrate that for certain 
programs, our system is able to infer specifications that are 
often precise and complete enough to be machine verifiable. 

Section 3.1 presents our methodology. Sections 3.2 
and 3.3 discuss two example programs in detail; these sec- 
tions characterize the generated specifications and provide 
an intuition about the output of our system. Section 3.4 
overviews other experiments and highlights the types of 
problems the system may encounter. 


3.1 Methodology 


We analyzed the programs listed in Figure 3. (Figure 4 sum- 
marizes the results.) The first three programs come from 
a data structures textbook [Wei99]; Vector is part of the 
Java standard library [Bla]; and the last six programs are 
staff solutions to assignments in a programming course at 
MIT [MITO1]. 


All of the programs except Vector came with test suites, 
either from the textbook or that were used for grading. Sev- 
eral of these test suites were small unit tests that contained 
just three or four calls per method and did not exercise the 
program’s full functionality. We extended the deficient test 
suites, an easy task (see Section 4.4). We wrote our own test 
suite for Vector. 


As described in Section 1.1, our system runs Daikon and 
inserts its output into the target program as ESC/Java an- 
notations. Some of Daikon’s invariants are inexpressible in 
ESC/Java’s notation (the “Inexpr” column of Figure 4; also 
see Section 4.1). We did not study these further. 


We determined by hand how many of Daikon’s invari- 
ants were redundant because they were logically implied by 
other invariants (the “Redund” column of Figure 4). We en- 
sured that redundant invariants verified exactly when their 
non-redundant counterparts did. We removed all of these in- 
variants from further consideration, for two reasons. First, 
Daikon attempts to avoid reporting redundant invariants, but 
its tests are not perfect; these results indicate what an im- 
proved tool could achieve. More importantly, only one re- 
dundant invariant did not verify, so including redundant in- 
variants would have inflated our results. Users would not 
need to remove the redundant invariants in order to use the 
tool. 


We then measured how different the reported invariants 
are from a set of annotations that ESC/Java can verify (while 
also verifying that no run-time errors occur). There are po- 
tentially many such verifiable sets. For instance, one set of 
annotations might only ensure that no run-time errors occur, 
while another set might also ensure that a representation in- 
variant is maintained. We selected as our goal set the one that 
required the smallest number of annotations to be added to 
or removed from the set that Daikon reported. This is a mea- 
sure of how different the reported invariants are from a set 
that is both consistent and sufficient for ESC/Java’s check- 
ing — an objective measure of how much of the semantics of 
the program was captured by Daikon from the program exe- 
cutions. It is also a measure of programmer effort to verify 
the program with ESC/Java, starting from a set of invariants 
detected by Daikon. One potential source of error is that 
we selected the goal set of annotations by hand; it is possi- 
ble that we overlooked a closer goal. However, the numbers 
we present are a pessimistic bound, because any such error 
would degrade them. 


Given the set of reported invariants and the goal set, we 
counted the number of invariants in both sets (the “Verif” 
column of Figure 4), the number only reported by Daikon 
(the “Unver” column), and the number only in the goal set 
(the “Miss” column). We computed precision and recall 
based on these three numbers. 
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reported invariants that were inexpressible in ESC/Java’s annotation language. 
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Figure 4: Summary of invariants detected by Daikon and verified by ESC/Java. The programs are described in Figure 3. “LOC” is the total 
lines of code. ““NCNB” is the non-comment, non-blank lines of code. “Meth” is the number of methods. “Verif” is the number of reported 
invariants that ESC/Java verified. “Unver’ is the number of reported invariants that ESC/Java failed to verify. “Inexpr” is the number of 


“Redund” is the number of reported invariants that were 


“Report” is the total number of reported invariants, the sum of the previous four 


columns. “Miss” is the number of invariants not reported by Daikon but required by ESC/Java for verification. “Prec” is the precision of the 
reported invariants, the ratio of verifiable to verifiable plus unverifiable invariants. “Recall” is the recall of the reported invariants, the ratio 


of verifiable to verifiable plus missing. 


3.2 StackAr: array-based stack 


The StackAr example is an array-based stack implementa- 
tion [Wei99]. The source contains 50 non-comment lines of 
code in 8 methods, along with comments that describe the 
behavior of the class but do not mention its representation 
invariant. 

Our system determined the representation invariant, 
method preconditions, modification targets, and postcondi- 
tions, and verified that these properties hold. Daikon invari- 
ant detector finds 32 invariants, of which 25 are candidates 
for verification. In addition, our system heuristically added 
2 annotations involving aliasing of the array. 

Figure 5 shows part of the automatically-annotated source 
code for StackAr. The first six annotations describe the 
representation invariant, stating that the array index is legal 
and only unused storage is null. The next three annotations 
describe the specification for the constructor. Daikon also 
detects that after construction, all elements of the array are 
null, but this property is implied by the representation in- 
variant, so Daikon does not report the property and it is not 
included in the results. 

Our system generated specifications for all operations of 
the class, and verified that the implementation met the spec- 
ification. For example, a postcondition for the pop method 
was the bi-implication: 


(\old(topofStack) == -1) == (\result == null) 


This invariant states that the method returns nu11 if and 
only if the stack is empty upon entry. 

Without these annotations, ESC/Java issues warnings 
about many potential runtime errors, such as null derefer- 
ences and array bounds errors. With the addition of the de- 
tected invariants, ESC/Java issues no warnings, successfully 


public class StackAr 
{ 

//@ 
//@ 
//@ 


invariant 
invariant 
invariant 


theArray != null 

\typeof (theArray) == \type (Object []) 
topOfStack >= -1 

//@ invariant topOfStack <= theArray.length-1 

/7*@ invariant (\forall int i; (0 <= i && 

i <= topoOfStack) ==> (theArray[i] != null)) */ 

/*@ invariant (\forall int i; (topOfStack+1l <= i && 

i <= theArray.length-1) ==> (theArray[i] == null)) */ 


public StackAr( int capacity ) 
//@ requires capacity >= 0 
//@ ensures capacity == theArray.length 
//@ ensures topOfStack == -1 
{ 
theArray = new Object[ capacity ]; 
topOfStack = -1; 
//@ set theArray.owner = 


} 


this 


/*@ spec_public */ private Object [ ] theArray; 
//@ invariant theArray.owner == this 


/*@ spec_public */ private int topOfStack; 


Figure 5: The object invariants, first method, and field declarations 
of the annotated StackAr. java file [Wei99]. The ESC/Java 
annotations (comments starting with “@’’) are produced automati- 
cally by Daikon, are automatically inserted into the source code by 
our system, and are automatically verified by ESC/Java. 


checks that the StackAr class avoids runtime errors, and 
verifies that the implementation meets its generated specifi- 
cation. 
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Figure 6: Breakdown of invariants detected by Daikon in the 
RatPoly program. The invariants are divided into object invari- 
ants, preconditions, and postconditions. The columns are the same 
as the “Number of invariants” columns of Figure 4. 


3.3. RatPoly: polynomial over rational num- 
bers 


A second example further illustrates our results, and pro- 
vides examples of verification problems. 


The Rat Poly program is an implementation of rational- 
coefficient polynomials that support basic algebraic opera- 
tions [MITO1]. The source contains 498 non-comment lines 
of code, in 3 classes and 42 methods. Informal comments 
state the representation invariant and method specifications. 
Our system produced an annotation set that was close to 
a verifiable set. Additionally, the annotation set reflected 
some properties of the programmer’s specification, which 
was given by informal comments. 


Figure 6 shows that Daikon reported 123 invariants over 
the class; 10 of those did not verify, and 5 more had to be 
added. 


The unverifiable invariants were all true, but other miss- 
ing invariants prevented them from being verified. For in- 
stance, the RatPoly implementation maintains an object 
invariant that no zero-value coefficients are ever explicitly 
stored, so Daikon reported that a get method never returns 
zero. However, since elements of Java collection classes 
may not be accessed in ESC/Java annotations, the object in- 
variant is not expressible and the get method failed to ver- 
ify. Similarly, the mul operation exits immediately if one of 
the polynomials is undefined, but the determination of this 
condition also required annotations accessing Java collec- 
tions. Thus, ESC/Java could not prove that helper methods 
used by mul never operated on undefined coefficients, as 
reported by Daikon. 


The invariants that had to be added were of two cate- 
gories. Some were due to a specification language mismatch 
between ESC/Java and Daikon. Daikon uses consistent no- 
tation to state the runtime type of elements in a sequence, 
whether it is an array or a Java collection class; ESC/Java 
expresses the two in unrelated notations. In our experiments, 
these properties had to be translated by hand, but automating 
this step is straightforward. The rest of the missing invariants 
were detected by Daikon, but suppressed for lack of statis- 
tical justification. Providing a more extensive test suite, or 
improving Daikon’s statistical measures, would correct this 
problem. 


3.4 Other experiments 


We also performed eight other experiments, as shown in Fig- 
ure 4. The results were positive and ranged from complete 
success as for StackAr to the occasional problems as out- 
lined for RatPoly. The average precision (a measure of 
correctness) and recall (a measure of completeness) were 
0.95 and 0.94, respectively. 

Unverifiable invariants were either test suite artifacts or 
lacked supporting invariants. Test suite artifacts arise when 
the test suite maintains a property, even though that prop- 
erty is not generally true. These problems often indicate a 
deficiency in testing, but did not arise frequently in these ex- 
periments (see Section 4.4). 

Unverifiable invariants more commonly occur when sup- 
porting invariants are outside the scope of the tools. For in- 
stance, in the Rat Num class, Daikon found that the negate 
method preserves the denominator and negates the numera- 
tor. However, verifying that property would require detect- 
ing and verifying that the gcd operation called by the con- 
structor has no effect because the numerator and denomina- 
tor of the argument are relatively prime. 

Missing invariants that could have reasonably been ex- 
pected to be detected can also lead to failed verification. For 
example, the QueueAr class guarantees that unused stor- 
age is set to null. The representation invariants that maintain 
this property were missing from Daikon’s output, because 
they were conditioned on a predicate more complicated than 
Daikon currently attempts. This omission prevented verifi- 
cation of many method postconditions. 

Redundant invariants—those implied by other invari- 
ants — are often unhelpful to the user because they convey 
no new information. For instance, in the DisjSets class, 
Daikon reported that the union method ensured a certain 
property over all elements, but also reported the same prop- 
erty for various subsets of the elements. Redundant invari- 
ants may occasionally highlight important conclusions not 
obvious to the programmer, such as when a conclusion de- 
pends on invariants from several other objects in the system. 
However, in general redundant invariants are not useful, and 
we plan to improve Daikon’s redundancy checks (see Sec- 
tion 7). 


4 Limitations 


This section discusses limitations of automatic generation 
and checking of program specifications. These limitations 
fall into three general categories: problems with the tools, 
problems with the target programs, and problems with the 
test suites for the target programs. 


4.1 ESC/Java 


ESC’s input language is a variant of the Java Modeling 
Language JML [LBR99, LBROO], an interface specification 


language that specifies the behavior of Java modules. We 
use “ESCJML” for the JML variant accepted as input by 
ESC/Java. 

Limitations of ESCJML prevent certain properties from 
being expressed. As a result, these properties must be omit- 
ted from the generated specifications, even though Daikon 
reports them as true over a program’s test suite. ESCJML 
annotations cannot include method calls, even ones that are 
side-effect-free. Daikon uses these for obtaining Vector 
elements and as predicates in implications. Unlike Daikon, 
ESCJML cannot express closure operations, such as all the 
elements in a linked list. 

ESCJML requires that object invariants hold at entry to 
and exit from all methods, so it warned that the object invari- 
ants Daikon reported were violated by private helper meth- 
ods. We worked around this problem by inlining one such 
method from the QueueAr program. 

ESCJML cannot express invariants over strings, although 
Daikon reports few such invariants in any event. As a re- 
sult, ESC/Java cannot verify that object invariants hold at 
the exit from a constructor or other method that interprets a 
string argument, even though it can show that the invariant 
is maintained by other methods. 

The full JML language permits method calls in assertions, 
\reach() for expressing reachability via transitive clo- 
sure, and specifies that object invariants hold only at entry 
to and exit from public methods. 

Some of this functionality might be missing from 
ESC/Java because it is designed not for proving general pro- 
gram properties but as a lightweight method for verifying 
absence of runtime errors. However, our investigations re- 
vealed examples where such verification required each of 
these missing capabilities. In some cases, ESC/Java users 
may be able to restructure their code to work around these 
problems. In others, users can insert unsound pragmas 
that cause ESC/Java to assume particular properties with- 
out proof, permitting it to complete verification despite its 
limitations. We did not use any such pragmas in our experi- 
ments. 


4.2 Daikon 


A limitation of automatic generation of specifications in- 
volves invariants that Daikon does not detect — missing 
classes of invariants. Section 3.4 discussed problems with 
a negate method for rational numbers; a possible solution 
is to detect when numbers are relatively prime. We had pre- 
viously rejected that invariant as of insufficiently general ap- 
plicability. 

Compared with previously published work, the version 
of Daikon used in this experiment incorporates several im- 
provements essential to generating verifiable specifications. 
Of most interest, Daikon’s conditioning predicates were en- 
hanced to include boolean procedure return values and pro- 
cedure exit points. Daikon uses these predicates to produce 
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Figure 7: Comparison of program size to test suite size, given in 
non-comment, non-blank lines of code. “NCNB” is size of the pro- 
gram; “Original” is the size of its original, accompanying test suite; 
“Added” is the number of lines added to yield the results described 
in Section 3. “Sys” indicates a system test not specifically focused 
on the program (see Section 4.4). 


implications and disjunctions, which are critical to specify- 
ing methods that take different actions depending on internal 
state. 


4.3 Target programs 


Another challenge to verification of invariants is the like- 
lihood that programs contain errors that falsify the desired 
invariant. (Although it was never our goal, we have pre- 
viously identified such errors in textbooks [Gri81, Wei99], 
in programs used in testing research [HFGO94, RH98], and 
elsewhere.) As an example of a likely error that we detected 
in the course of this project, one of the object invariants for 
StackAr states that unused elements of the stack are null. 
The pop operations maintain this invariant (which approxi- 
mately doubles the size of their code), but the makeEmpt y 
operation does not. We noticed this when the expected ob- 
ject invariant was not inferred, and we corrected the error in 
our version of StackAr. 


4.4 Test suites 


A final challenge to generation is deficient or missing test 
suites. If the executions provided by a test suite are not 
characteristic of a program’s general behavior, properties ob- 
served during testing may not generalize. However, one of 
the key results of this research is that even limited test suites 
can capture certain semantics of a program. 

Figure 7 shows relative sizes of test suites and programs 
used in this experiment. Test suites for the smaller programs 
were larger in comparison, but no test suite was unreason- 
ably sized. 

System tests —tests that check end-to-end behavior of a 
system —tended to produce good invariants immediately, 
confirming earlier experiences [ECGNO1]. These system 
tests were for a system containing the module we examined, 
rather than being just for the module itself. 


Unit tests — tests that check specific boundary values of 
procedures in a single module in isolation— were not im- 
mediately successful. When the initial test suites were unit 
tests that came from the textbooks or were used for grad- 
ing, they often contained just three or four calls per method. 
Some methods on StreetNumberSet were not tested at 
all. 

We corrected these test suites, but did not attempt to make 
them minimal. The corrections were not difficult. When 
failed ESC/Java verification attempts indicate a test suite is 
deficient, the unverifiable invariants specify the unintended 
property, so a programmer knows exactly how to improve 
the tests. For example, the original tests for the div opera- 
tion on Rat Poly exercised a wide range of positive coeffi- 
cients, but all tests with negative coefficients used a numera- 
tor of —1. Other examples included certain stack operations 
not being performed on a full stack, calls to a safe stack pop 
operation always being protected by a check whether the ar- 
ray was empty, and a queue implemented via an array not be- 
ing forced to wrap around. These properties were detected 
and reported as unverifiable by our system, and extending 
the tests to cover additional values was effortless. 

Test suites are an important part of any programming ef- 
fort, so time invested in their improvement is not wasted. 
In our experience, the additional effort (if any) required to 
obtain accurate invariants is indistinguishable from that re- 
quired to create a general test suite. In short, poor ver- 
ification results indicate specific failures in testing, and 
reasonably-sized test suites are able to accurately capture se- 
mantics of a program. 


5 Discussion 


The most surprising result of our research is that specifica- 
tions generated from program executions are reasonably ac- 
curate: they form a set that is (nearly) self-consistent and 
self-sufficient, as measured by verifiability by an automatic 
specification checking tool. This result was not at all obvious 
a priori. One might expect that dynamically detected invari- 
ants would suffer from serious unsoundness by expressing 
artifacts of the test suite and would fail to capture enough of 
the (formal) semantics of the program. 

This positive result implies that dynamic invariant detec- 
tion is effective, at least in our domain of investigation. A 
second, broader conclusion is that executions over relatively 
small test suites capture a significant amount of informa- 
tion about program semantics. This detected information is 
equivalent to that resulting from, and verifiable by, a static 
analysis. Although we do not yet have a theoretical model 
to explain this, nor can we predict for a given test suite how 
much of a program’s semantic space it will explore, we have 
presented a datapoint from a set of experiments to explicate 
the phenomenon and suggest that it may generalize. 

We speculate that three factors may contribute to our suc- 
cess. First, our specification generation technique does not 


attempt to report all properties that happen to be true dur- 
ing a test run. Rather, it produces partial specifications that 
intentionally omit properties that are unlikely to be of use 
or that are unlikely to be universally true. It uses statisti- 
cal, algorithmic, and heuristic tests to make this judgment. 
Second, the information that ESC/Java needs for verifica- 
tion may be particularly easy to obtain via a dynamic anal- 
ysis. ESC/Java’s requirements are modest: it does not need 
full formal specifications of all aspects of program behav- 
ior. However, it does require some specifications and input— 
output relations, and we were able to verify detected prop- 
erties that were not strictly necessary for ESC’s checking, 
but provided additional information about program behav- 
ior. Third, our test suites were of acceptable quality. Unit 
tests are inappropriate, for they produce very poor invariants. 
However, Daikon’s output makes it extremely easy to im- 
prove the test suites by indicating exactly what is wrong with 
them. Furthermore, existing system tests were adequate, and 
these are more likely to exist and often easier to produce. 


Our results suggest a new metric for test suite quality, 
which we call “value coverage” [Ham87, CR99] or “spec- 
ification coverage.” Specifications are closer than code cov- 
erage is to the abstract, semantic level at which programs 
are often understood. Software engineers may more readily 
interpret program properties than specific paths through the 
program, even if they would eventually equate the two. We 
are unsure whether specification-complete (or specification- 
verifiable) test suites — that is, test suites from whose exe- 
cutions complete or verifiable specifications can be dynam- 
ically extracted — are good for catching bugs, and whether 
they tend to be coverage-complete. We would like to further 
investigate these topics. 


We do know that dynamically detected program invariants 
make it easy to construct and extend test suites to achieve 
specification completeness. There is substantial anecdotal 
evidence that they also assist in detection of bugs. For ex- 
ample, in addition to the StackAr problem noted in Sec- 
tion 4.3, our experiments also revealed a bug in the Vec-— 
tor class from JDK 1.1.8. The toString method throws 
an exception for vectors with null elements. Our original 
(code coverage complete) test suite did not reveal this bug, 
but Daikon reported that the vector elements were always 
non-null on entry to toSt ring, leading to discovery of the 
bug. The bug is corrected in JDK 1.3. 


The goal of producing program specifications is so impor- 
tant that it is worthwhile to consider many approaches. Our 
research suggests that a novel approach can complement ex- 
isting ones: generate the specification unsoundly, then check 
it, resulting in a specification and a verification of its cor- 
rectness. We believe that unsound specifications can also be 
used to advantage in other situations: this can expand the 
applicability and utility of specifications and provide many 
of the benefits of sound specifications, in more situations. 
Even if full input—output relations are hard to generate auto- 
matically, universally true properties (especially conditional 


invariants) that characterize the relation are a step in the right 
direction. 


5.1 Benefits of integration 


Static and dynamic analyses have complementary strengths 
and weaknesses, so combining them has great promise: dy- 
namic analysis can propose program properties to be verified 
by static analysis. Integrating dynamic invariant detection 
with static verification has benefits for both tools. 

Use of a static verifier to augment dynamic invariant de- 
tection overcomes a potential objection about possibly un- 
sound output, classifies the output (as proven true or poten- 
tially incorrect) to permit programmers to use it more effec- 
tively, permits verified invariants to be used in contexts (such 
as input to certain programs) that demand sound input, and 
may improve the performance or output of dynamic invariant 
detection. As a result, more programmers can take advan- 
tage of dynamically detected invariants in a variety of con- 
texts. This may eventually lead to fewer bugs (by introduc- 
ing fewer and detecting more), better documentation, less 
time wasted on program understanding, better test suites, 
more effective validation of program changes, and more ef- 
ficient programs. 

Use of dynamically detected invariants can bootstrap 
static verification by providing initial program annotations, 
goals, and intermediate assertions. Few programmers enjoy 
or are good at annotating programs, a time-consuming, te- 
dious, and error-prone task. This automation may speed the 
adoption of static analysis tools by lessening the user burden, 
even if some work still remains for the user. Dynamically 
detected invariants can also check and refine existing spec- 
ifications and indicate properties programmers might other- 
wise have overlooked. These improvements could lead to 
prevention and to earlier detection of errors, aiding in the 
production of more robust, reliable, and correct computer 
systems. 


6 Related work 


This is the first research we are aware of that has dynam- 
ically generated, then statically verified, program specifica- 
tions, or has used such information to investigate the amount 
of information about program semantics available in test 
runs. The two component techniques are well-known, how- 
ever. 

Dynamic analysis has been used for a variety of tasks; 
for instance, inductive logic programming (ILP) [Qui90, 
Coh94] produces a set of Horn clauses (first-order if-then 
rules) and can be run over program traces [BG93], though 
with limited success. Programming by example [CHKt 93] 
is similar but requires close human guidance, and version 
spaces can compactly represent sets of hypotheses [Mit78, 
Hir91, LDWOO]. Value profiling [CFE97, SS98, CFE99] 
can efficiently detect certain simple properties at runtime. 


Event traces can generate finite state machines that explicate 
potential system organization or behavior [BG97, CW98a, 
CW98b]. Program spectra [AFMS96, RBDL97, HRWY98, 
Bal99] also capture aspects of system runtime behavior. 
None of these other techniques has been as successful as 
Daikon for detecting invariants in programs, though many 
have been valuable in other domains. 

Many static inference techniques also exist, including ab- 
stract interpretation (often implemented by symbolic exe- 
cution or dataflow analysis), model checking, and theorem 
proving. (Space prohibits a complete review here.) A sound, 
conservative static analysis reports properties that are true 
for any program run, and theoretically can detect all sound 
invariants if run to convergence [CC77]. Static analyses omit 
properties that are true but uncomputable and properties of 
the program context. To control time and space complex- 
ity (especially the cost of modeling program states) and en- 
sure termination, they make approximations that introduce 
inaccuracies, weakening their results. For instance, accu- 
rate and efficient alias analysis is still beyond the state of 
the art [CWZ90, LR92, WL95], though for specific appli- 
cations, contexts, or assumptions, efficient pointer analyses 
can be sufficiently accurate [Das00]. 

There are many other tools besides ESC/Java for statically 
checking specifications [Pfe92, DC94, EGHT94, Det96, 
Eva96, NCOD97, LN98]. These other systems have differ- 
ent strengths and weaknesses than ESC/Java, but few have 
the polish of its integration with a real programming lan- 
guage. 

An independent project [JvH*98, HJv01] verified an ob- 
ject invariant in Java’s Vector class, using automatic trans- 
lation to PVS [ORS92, ORSVH95], user-specified goals, and 
some user interaction with PVS. 


6.1 Houdini 


The research most closely related to our integrated sys- 
tem is Houdini, an annotation assistant for ESC/Java 
[FLO1, FJLO1]. (A similar system was proposed by Rin- 
tanen [Rin00].) Houdini is motivated by the observation that 
users are reluctant to annotate their programs with invari- 
ants; it attempts to lessen the burden by providing an initial 
set. Houdini takes a candidate annotation set as input and 
computes the greatest subset of it that is valid for a particu- 
lar program. It repeatedly invokes the checker and removes 
refuted annotations, until no more annotations are refuted. 
The candidate invariants are all possible arithmetic compar- 
isons among fields (and “interesting constants” such as —1, 
0, 1, array lengths, and nu11); many elements of this initial 
set are mutually contradictory. 

Houdini has been used to find bugs in several programs. 
Over 30% of its guessed annotations are verified, and it tends 
to reduce the number of ESC/Java warnings by a factor of 2— 
5. At present, Houdini may be more scalable than our sys- 
tem. Houdini took 62 hours to run on a 36,000-line program. 


Daikon has run in under an hour on several 10,000-line pro- 
grams. Because it currently operates offline in batch mode, 
its memory requirements make Daikon unlikely to scale to 
significantly larger systems without re-engineering; such an 
effort is now underway. This is a limitation of the Daikon 
prototype, not of the technique of dynamic invariant detec- 
tion. 

Daikon’s candidate invariants are richer than those of 
Houdini; Daikon outputs implications and disjunctions, and 
its base invariants are also richer, including more compli- 
cated arithmetic and sequence operations. If even one re- 
quired invariant is missing, then Houdini eliminates all other 
invariants that depend on it. Houdini makes no attempt to 
eliminate implied (redundant) invariants, as Daikon does (re- 
ducing its output size by an order of magnitude [ECGNO00)]), 
so it is difficult to interpret numbers of invariants produced 
by Houdini. Finally, Houdini is not publicly available, so we 
cannot perform a direct comparison. 

Merging the two approaches could be very useful. For 
instance, Daikon’s output could form the input to Houdini, 
permitting Houdini to spend less time eliminating false in- 
variants. (A prototype “dynamic refuter’ — essentially a dy- 
namic invariant detector — has been built [FLO1], but no de- 
tails or results about it are provided.) Houdini has a different 
intent than Daikon: Houdini does not try to produce a com- 
plete specification or annotations that are good for people, 
but only to make up for missing annotations and permit pro- 
grams to be less cluttered; in that respect, it is similar to 
type inference. However, Daikon’s output could perhaps be 
used in place of Houdini’s. Invariants that are true but de- 
pend on missing invariants or are not verifiable by ESC/Java 
would not be eliminated, so users might be closer to a com- 
pletely annotated program, though they might need to elim- 
inate some invariants by hand. 


7 Future work 


Section 4 listed a number of limitations of our system (and 
its components Daikon and ESC/Java) that should be cor- 
rected. We would also like to investigate what test suites 
lead to good specifications, as noted in Section 5. 

Another obvious way to extend this work is to use dif- 
ferent invariant detectors than Daikon or different verifiers 
than ESC/Java. Section 6 lists some other invariant de- 
tectors. Examples of static verifiers that are connected 
with real programming languages include LCLint [EGHT94, 
Eva96, Eva00], ACL2 [KM97], LOOP [JvHt 98], Java Path- 
Finder [HPOO], and Bandera [CDHt 00]. 

We are currently integrating Daikon with IOA [GLV97, 
GLO00], a formal language for describing computational pro- 
cesses that are modeled using I/O automata [Lyn96, LT87, 
LT89]. The IOA toolset (http: //theory.1lcs.mit.edu/ 
tds/ioa.html) permits IOA programs to be run and also 
provides an interface to the Larch Prover [GG90, GG91, 
SAGG* 93], an interactive theorem-proving system for mul- 
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tisorted first-order logic. Daikon will propose goals, lem- 
mas, or intermediate assertions for the theorem prover. 
Side conditions such as representation invariants can enable 
proofs that hold in all reachable states or representations (but 
not in all possible states or representations). It can be tedious 
and error-prone for people to specify the properties to be 
proved, and current systems have trouble postulating them; 
some researchers consider that task harder than performing 
the proof [Weg74, BLS96, BBM97]. Our preliminary exper- 
iments have resulted in the automatic detection of invariants 
used in a published proof [GLOO]. 

We are also interested in recovering from failed attempts 
at static verification. Broadly speaking, verification fails be- 
cause the goal properties are too strong or too weak. Prop- 
erties that are too strong may be true but beyond the capa- 
bilities of the verifier, or may not be universally true (for 
instance, guaranteed by the program context or artifacts of 
the test suite). Properties that are too weak are true, but can- 
not be proved by the static verifier or are not useful to it — 
for instance, loop invariants may need to be strengthened 
to be proved. We anticipate that dynamic invariant detection 
will propose more overly-strong invariants than overly-weak 
ones. When verification fails, we would like to know how 
to strengthen and weaken invariants in a principled way, by 
examining the source code, program executions, patterns of 
invariants, and verifier output, to increase the likelihood of 
successful verification. 

While dynamic invariant detection has been successful in 
several application domains, we believe that truly success- 
ful program analysis requires both static and dynamic com- 
ponents. Some of the properties that are difficult to obtain 
from a dynamic analysis are apparent from an examination 
of the source code, and properties that are beyond the state 
of the art in static analysis can be easily checked at runtime. 
We plan to integrate more static analysis into our system 
(and particularly into Daikon). For example, the dynamic 
analysis need not check properties discovered by the static 
analysis, and the dynamic analysis can focus on statically 
indicated code. 


8 Conclusion 


We have proposed, implemented, and experimentally as- 
sessed a novel approach to producing correct specifications: 
generate them unsoundly from program executions, then 
verify them. To our knowledge, ours is the first system to 
dynamically detect and then statically verify program speci- 
fications. 

Our experiments indicate that even limited test suites ac- 
curately characterize general execution properties: they can 
generate a consistent and sufficient set of specifications that 
can be automatically verified with little or no change. This 
surprising result suggests that runtime properties may not 
be as unreliable as general opinion holds, given an effective 
method for extracting them. We do not yet have a princi- 


pled description of the static characteristics of a test suite 
that result in a high-quality generated specification, but even 
simple system tests seem to be sufficient. 

Our experiments also demonstrate the effectiveness of dy- 
namic invariant detection, and of the Daikon implementa- 
tion. More specifically, in our tests, it generated specifica- 
tions with high (about 95%) precision and recall, when mea- 
sured against the task of static verification by ESC/Java. This 
validates the approach of producing invariants from program 
executions. 

The results generally justify the use of unsound tech- 
niques in appropriate ways in program development and sug- 
gest that these may be extended to program specifications, 
which have traditionally required complete correctness. We 
also found that integrating static and dynamic techniques in 
our system produces benefits in each direction, because of 
their complementary strengths and weaknesses. Finally, dy- 
namically generated specifications may assist in bug detec- 
tion and prove to be a valuable measure of test suite quality. 
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